Models for Socio-Environmental Data

Multilevel Regression

May 29, 2019

Preliminaries


Motivation

Each section of this lab has two parts– a model building exercise and a model coding exercise. The material covered here is important and broadly useful – building multi-levels models is a true workhorse for understanding ecological processes because so many problems contain information at nested spatial scales, levels of organization, or categories. It will be worthwhile to dig in deeply to understand it. The big picture is to demonstrate the flexibility that you gain as a modeler by understanding basic principles of Bayesian analysis. To accomplish that, these exercises will reinforce the following:

  1. Diagramming and writing hierarchical models
  2. Using data to model parameters
  3. JAGS coding
  4. Creating index variables, a critically important and useful skill
  5. Posterior predictive checks



Introduction

Ecological data are often collected at multiple scales or levels of organization in nested designs. Group is a catchall term for the upper level in many different types of nested hierarchies. Groups could logically be composed of populations, locations, species, treatments, life stages, and individual studies, or really, any sensible category. We have measurements within groups on individual organisms, plots, species, time periods, and so on. We may also have measurements on the groups themselves, that is, covariates that apply at the upper level of organization or spatial scale or the category that contains the measurements. Multilevel models represent the way that a quantity of interest responds to the combined influence of observations taken at the group level and within the group.

Nitrous oxide N2O, a greenhouse gas roughly 300 times more potent than carbon dioxide in forcing atmospheric warming, is emitted when synthetic nitrogenous fertilizers are added to soils. Qian and colleagues (2010) conducted a Bayesian meta-analysis of N2O emissions (g N \(\cdot\) ha-1 \(\cdot\) d-1) from agricultural soils using data from a study conducted by Carey (2007), who reviewed 164 relevant studies. Studies occurred at different locations, forming a group-level hierarchy (we will use only sites that have both nitrogen and carbon data, which reduces the number of sites to 107 in the analysis here). Soil carbon content (g \(\cdot\) organic C \(\cdot\) g-1 soil dry matter) was measured as a group-level covariate and is assumed to be measured without error. Observations of N2O emission are also assumed to be measured without error and were paired with measurements of fertilizer addition (kg N\(\cdot\) ha-1 \(\cdot\) year-1). The effect of different types of fertilizer was also studied.

You are going to use these data to build increasingly complex models of N2O emission. The initial models will ignore some important covariates as well as how the data are structured hierarchically into sites. This is ok! When writing for a multi-level model like this one, do it incrementally, starting with a separate model for each site (the no-pool model) or a model that ignores sites entirely (the pooled model). After getting these models to work you can add complexity by drawing the intercept for each model from a distribution, before pursuing further refinements. We strongly sugggest this approach because it is always best to do the simple thing first: there is less to go wrong. Also, when things do go wrong it will be clearer as to what is causing the problem.


Pooled


Diagramming and writing the model

Let’s begin by ignoring the data on soil carbon, site, and fertilizer type so that all observations are drawn from a single pool. This is what’s known as complete pooling (see Gelman and Hill, 2007), or just a pooled model. You will use a linearized power function for your deterministic model of emissions as a function of nitrogen input:

\[ \begin{aligned} \mu_{i} & = \gamma x_{i}^{\beta}\\ \alpha & = \log\big(\gamma\big)\\ \log\big(\mu_{i}\big) & = \alpha+\beta\big(\log(x_i)\big)\\ g\big(\alpha,\beta,\log(x_i)\big) & = \alpha+\beta\big(\log(x_i)\big) \\ \end{aligned} \]


  1. Interpret the coefficients \(\alpha\), \(\beta\), and \(\gamma\) in this model.


  1. Draw a Bayesian network for a single observation from this model.


  1. Write out the joint distribution for a linear regression model of N2O emission (\(y\)) on fertilizer addition (\(x\)) for a single observation. Start by using generic \([\,]\). Use \(\sigma^{2}\) to represent the uncertainty in your model realizing, of course, that you might need moment matching when you choose a specific distribution.


  1. Finish by choosing specific distributions for likelihoods and priors. You will use the math in the answer as a template to code your model in the subsequent exercises. What are assuming about the distribution of the untransformed \(\mu_i\)?


  1. What is the hypothesis represented by this model?



Visualizing the data

You need to load the ggplot2, gridExtra, rjags, MCMCVis, SESYNCBayes and HDInterval libraries. Set the seed to 10 to compare your answers to ours. It is always a good idea to look at the data. Examine the head of the data frame for emissions. Note that the columns group.index and fert.index contain indices for sites and fertilizer types. We are going to ignore these for now since the pooled model does not take these into account. Use the code below to plot N2O emissions as a function of fertilizer input for both the logged and unlogged data.

head(N2OEmission)
##   fertilizer group carbon n.input emission reps group.index fert.index
## 1          A    14    2.7     180    0.620   13          10          2
## 2          A    14    4.6     180    0.450   13          10          2
## 3          A    11    0.9     112    0.230   12           7          2
## 4          A    38    0.5     100    0.153   14          29          2
## 5          A     1    4.0     250    1.000    6           1          2
## 6          A    38    0.5     100    0.216   14          29          2
g1 <- ggplot(data = N2OEmission) +
  geom_point(aes(y = emission, x = n.input), alpha = 3/10, shape = 21, colour = "black", 
    fill = "brown", size = 3) +
  theme_minimal()
g2 <- ggplot(data = N2OEmission) +
  geom_point(aes(y = log(emission), x = log(n.input)), alpha = 3/10, shape = 21, colour = "black", 
    fill = "brown", size = 3) +
  theme_minimal() 
gridExtra::grid.arrange(g1, g2, nrow = 1)



Fitting the model with JAGS

You will now write a simple, pooled model where you gloss over differences in sites and fertilizer types and lump everything into a set of \(x\) and \(y\) pairs using the R template provided below. It is imperative that you study the data statement and match the variable names in your JAGS code to the left hand side of the = in the data list. Call the intercept alpha, the slope beta and use sigma to name the standard deviation in the likelihood. Also notice, that we center the nitrogen input covariate to speed convergence. You could also standardize this as well.

In addition to fitting this model, we would like you to have JAGS predict the mean logged emission response to nitrogen input and the median unlogged emission response (Why median? Hint: think back to the distribution of the untransformed data above in question 3 above). To help you out we have provided the range of N2O values to predict over as the third element in the data list. Make sure you understand how we chose these values.

Note that in this problem and the ones that follow we have set up the data and the initial conditions for you. This will save time and frustration, allowing you to concentrate on writing code for the model but you must pay attention to the names we give in the data and inits lists. These must agree with the variable names in your model. Please see any of the course instructors if there is anything that you don’t understand about these lists.

n.input.pred <- seq(min(N2OEmission$n.input), max(N2OEmission$n.input), 10)

data = list(
  log.emission = log(N2OEmission$emission),
  log.n.input.centered = log(N2OEmission$n.input) - mean(log(N2OEmission$n.input)),
  log.n.input.centered.pred = log(n.input.pred) - mean(log(N2OEmission$n.input)))

inits = list(
  list(alpha = 0, beta = .5, sigma = 50),
  list(alpha = 1, beta = 1.5, sigma = 10),
  list(alpha = 2, beta = .75, sigma = 20))
  1. Write the code for the model. Compile the model and execute the MCMC to produce a coda object. Produce trace plots of the chains for model parameters. Produce a summary table and caterpillar plot for the parameters and tests for convergence including the effective sample size.



Visualizing model predictions

Let’s overlay the predicted mean logged emission response and the predicted median unlogged emission response from the pooled model on the raw data. We summarize the predictions using MCMCpstr() (think back to the JAGSPrimer) and then plot the median of the posterior distribution as a black line with geom_line() and the 95% credible intervals as a yellow shaded region using the geom_ribbon() function. Note that we need to back transform the x-values used for prediction. This is done so that the predicted values line up properly on the plot. Again, we will provide you with the code to do this to save time. You will need to modify this code in the future to make similar plots for models you fit in later sections.

pred <- MCMCpstr(zc.pooled, params = c("mu_pred", "log_mu_pred"), func = function(x) quantile(x, c(.025, .5, .975)))
mu.pred <- cbind(n.input.pred, data.frame(pred$mu_pred))
log.mu.pred <- cbind(log.n.input.pred = log(n.input.pred), data.frame(pred$log_mu_pred))
g3 <- g1 +
  geom_line(data = mu.pred, aes(x = n.input.pred, y = X50.)) +
  geom_ribbon(data = mu.pred, aes(x = n.input.pred, ymin = X2.5., ymax = X97.5.), alpha = 0.2, fill = "yellow")

g4 <- g2 +
  geom_line(data = log.mu.pred, aes(x = log.n.input.pred, y = X50.)) +
  geom_ribbon(data = log.mu.pred, aes(x = log.n.input.pred, ymin = X2.5., ymax = X97.5.), alpha = 0.2, fill = "yellow")

gridExtra::grid.arrange(g3, g4, nrow = 1)


No-Pooled


Diagramming and writing the model

Great, you got the pooled model to fit and made some predictions from it. However, perhaps the idea of ignoring the site effects might is not sitting very well with you. So, let’s take this a step further by modeling the relationship between N2O emission and fertilizer input such that the intercept \(\alpha\) varies by site (we will again ignore the data on soil carbon and fertilizer type). This is the opposite of the pooled model where we completely ignored the effect of site. Here we treat the intercept from each site as independent.

\[ \begin{aligned} \mu_{ij} & = \gamma x_{ij}^{\beta}\\ \alpha_{j} & = \log\big(\gamma_{j}\big)\\ \log\big(\mu_{ij}\big) & = \alpha_{i}+\beta\big(\log(x_{ij})\big)\\ g\big(\alpha_{j},\beta,\log(x_{ij})\big) & = \alpha_{j}+\beta\big(\log(x_{ij})\big) \\ \end{aligned} \]


  1. Draw a Bayesian network and write out the posterior and joint distribution for a linear regression model of N2O emission (\(y\)) on fertilizer addition (\(x\)) for a single observation.


  1. For the joint distribution, start by using generic \([\,]\). Use \(\sigma^{2}\) to represent the uncertainty in your model realizing, of course, that you might need moment matching when you choose a specific distribution.


  1. Finish by choosing specific distributions for likelihoods and priors. You will use the math in the answer as a template to code your model in the subsequent exercises.


  1. What is the hypothesis represented by this model?



Visualizing the data

Let’s visualize the data again, but this time highlighting the role site plays in determining the relationship between N2O emission and nitrogen input. First, head() the data to see how groups are organized. You will use group.index to group the observations by site.

head(N2OEmission)
##   fertilizer group carbon n.input emission reps group.index fert.index
## 1          A    14    2.7     180    0.620   13          10          2
## 2          A    14    4.6     180    0.450   13          10          2
## 3          A    11    0.9     112    0.230   12           7          2
## 4          A    38    0.5     100    0.153   14          29          2
## 5          A     1    4.0     250    1.000    6           1          2
## 6          A    38    0.5     100    0.216   14          29          2

Use the code below to plot logged N2O emissions against logged fertilizer input. This is the same code as before except now we use the facet_wrap(~group.index) function to produce this plot for each site.

g2 + facet_wrap(~group.index)



Fitting the model with JAGS

You will now write a simple, no-pool model using the R template provided below. Again notice, that we center the nitrogen input covariate to speed convergence. You could also standardize this as well.

n.sites <- length(unique(N2OEmission$group.index))
n.input.pred <- seq(min(N2OEmission$n.input), max(N2OEmission$n.input), 10)

data = list(
  log.emission = log(N2OEmission$emission),
  log.n.input.centered = log(N2OEmission$n.input) - mean(log(N2OEmission$n.input)),
  log.n.input.centered.pred = log(n.input.pred) - mean(log(N2OEmission$n.input)),
  group = N2OEmission$group.index,
  n.sites = n.sites)

inits = list(
  list(alpha = rep(0, n.sites), beta = .5, sigma = 50),
  list(alpha = rep(1, n.sites), beta = 1.5, sigma = 10),
  list(alpha = rep(-1, n.sites), beta = .75, sigma = 20))
  1. Write the code for the model. Compile the model and execute the MCMC to produce a coda object. Produce trace plots of the chains for model parameters. Produce a summary table and caterpillar plot for the the \(\alpha\)’s and tests for convergence including the effective sample size.


  1. How is the model able to estimate intercepts for sites where there is only a single \(x\) value, or even sites where there is only a single observation at all?



Visualizing model predictions

pred_np <- MCMCpstr(zc.nopooled, params = "log_mu_site_pred", func = function(x) quantile(x, c(.025, .5, .975)))
pred_np_df <- adply(pred_np$log_mu_site_pred, c(1, 2))[, 2:5]
names(pred_np_df) <- c("group.index", "low", "mid", "high")
log.n.input.pred = rep(log(n.input.pred), n.sites)
pred_np_df <- cbind(log.n.input.pred, pred_np_df)
g2 +
  geom_line(data = pred_np_df, aes(x = log.n.input.pred, y = mid)) +
  geom_ribbon(data = pred_np_df, aes(x = log.n.input.pred, ymin = low, ymax = high), alpha = 0.2, fill = "yellow") +
  facet_wrap(~group.index)

Random Intercepts


Diagramming and writing the random intercepts model

So far we have either ignored the effect of site (the pooled model) or treated all sites as independent from one another (the no-pooled model). This time we are going to treat the sites as partially pooled, meaning we model the intercepts of the model as coming from a common distribution. In other words, treat the intercept in your model as a group level effect (aka, random effect). The deterministic model of N2O emissions remains a linearized power function, but two subscripts are required: \(i\) indexes the measurement within sites and \(j\) indexes site. Assume that the intercepts are drawn from a distribution with mean \(\mu_{\alpha}\) and variance \(\varsigma_{\alpha}^2\).

  1. Draw a Bayesian network and write out the posterior and joint distribution for a linear regression model of N2O emission (\(y\)) on fertilizer addition (\(x\)) for a single observation.


  1. For the joint distribution, start by using generic \([\,]\). Use \(\sigma^{2}\) to represent the uncertainty in your model realizing, of course, that you might need moment matching when you choose a specific distribution.


  1. Finish by choosing specific distributions for likelihoods and priors. You will use the math in the answer as a template to code your model in the subsequent exercises.



Fitting the model with JAGS

Now you will implement the model that allows intercept to vary by group, where each intercept is drawn from a common distribution. Again, use the template provided below to allow you to concentrate on writing JAGS code for the model. Note that you must use the index trick covered in lecture to align the different groups with different intercepts. Here are the preliminaries to set up the model:

n.input.pred <- seq(min(N2OEmission$n.input), max(N2OEmission$n.input), 10)
n.sites <- length(unique(N2OEmission$group.index))

data = list(
  log.emission = log(N2OEmission$emission),
  log.n.input.centered = log(N2OEmission$n.input) - mean(log(N2OEmission$n.input)),
  log.n.input.centered.pred = log(n.input.pred) - mean(log(N2OEmission$n.input)),
  group = N2OEmission$group.index,
  n.sites = n.sites)

inits = list(
  list(alpha = rep(0, n.sites), beta = .5, sigma = 50, mu.alpha= 0, sigma.alpha = 10),
  list(alpha = rep(1, n.sites), beta = 1.5, sigma = 10, mu.alpha= 2, sigma.alpha = 20),
  list(alpha = rep(-1, n.sites), beta = .75, sigma = 20, mu.alpha= -1, sigma.alpha = 12))
  1. Write the code for the model. Compile the model and execute the MCMC to produce a coda object. Produce trace plots of the chains for model parameters. Produce a summary table and caterpillar for the parameters and tests for convergence including the effective sample size.



Visualizing model predictions

Use the code from the pooled model to visualize the model predictions again.


  1. Why differ?

Diagramming and writing the random intercepts group effect model

In the previous example, we assumed that there was variation the intercept that was attributable to spatial variation among sites. We did not try to explain that variation, we simply acknowledged that it exists. Now we are going to “model a parameter” using data at the group-level to explain variation in the intercepts among sites. Modify the previous model to represent the effect of soil carbon on the intercept using the deterministic model. Here, we logit transform the carbon data to “spread them out” mapping 0-1 to all real numbers.

\[g_2\big(\kappa,\eta,\text{logit}(w_j)\big) =\kappa + \eta \text{logit}w_j\] to predict \(\alpha_j\).

  1. Draw a Bayesian network and write out the posterior and joint distributions for the full dataset, which means you must include the proper products. Use \(ij\) notation. You will need to notate that there are \(n_{j}\) measurements of {\(N\textsubscript{2}O\) emissions paired with fertilizer additions} from study \(j\). Chose appropriate distributions for each random variable.

  2. Draw a Bayesian network for a single observation from this model.


  1. Write out the posterior and joint distributions for the full dataset, which means you must include the proper products. Use \(ij\) notation. You will need to notate that there are \(n_{j}\) measurements of N2O emissions paired with fertilizer additions from study \(j\). Chose appropriate distributions for each random variable.



Fitting the model with JAGS

Modify your random intercepts model to implement the model you just developed that include a covariate at the site level soil carbon content.

n.input.pred <- seq(min(N2OEmission$n.input), max(N2OEmission$n.input), 10)
n.sites <- length(unique(N2OEmission$group.index))

data = list(
  log.emission = log(N2OEmission$emission),
  log.n.input.centered = log(N2OEmission$n.input) - mean(log(N2OEmission$n.input)),
  log.n.input.centered.pred = log(n.input.pred) - mean(log(N2OEmission$n.input)),
  w = log(SiteCarbon$mean / (100 - SiteCarbon$mean)), 
  group = N2OEmission$group.index,
  n.sites = n.sites)

inits = list(
  list(alpha = rep(0, n.sites), beta = .5, sigma = 50, sigma.alpha = 10, eta = .2, kappa = .5),
  list(alpha = rep(1, n.sites), beta = 1.5, sigma = 10, sigma.alpha = 20, eta = 3, kappa = .7),
  list(alpha = rep(-1, n.sites), beta = .75, sigma = 20, sigma.alpha = 12, eta = .1, kappa = .3))
  1. Write the code for the model. Compile the model and execute the MCMC to produce a coda object. Produce trace plots of the chains for model parameters. Produce a summary table and caterpillar for the parameters and tests for convergence including the effective sample size.



Visualizing model predictions

Use the code from the pooled model to visualize the model predictions again.



Random Coefficients


Model checking